Bellabeat manufactures high-tech health-focused smart products. Its Co-founder and artist Sršen has helped develop beautiful designed tech that empowers and inspires women all over the world. Bellabeat collects data on women activity, sleep, stress, and reproductive health. Founded in 2013, Bellabeat has steadily and quickly grown to position itself as a wellness tech company for women around the world.
By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products are available on Bellabeat.com and other online retail shops.
To identify market opportunities for growth and provide high-level recommendations to Bellabeat to help guide the company’s marketing strategy based on trends in smart device usage.
The FitBit Fitness Tracker Dataset has been recommended by Bellabeat’s Co-founder and Chief Creative Officer, Urška Sršen. ### Content This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
The FitBit Fitness Tracker Dataset has been pulled from Kaggle https://www.kaggle.com/datasets/arashnic/fitbit
I’m choosing to do this analysis in R because R affords me the sharing my notebook easily with colleagues and it’s also easily reproduced and pulled from GitHub repositories.
First off, I’ll be importing three (3) files from the dataset; * dailyActivity_merged * hourlyCalories_merged * sleepDay_merged
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(stringr)
daily_activity <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_activity)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 13162 8.5 8.5
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
Fetching data summaries of the data and checking for missing values
summary(daily_activity)
## Id ActivityDate TotalSteps TotalDistance
## Min. :1.504e+09 Length:940 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Mode :character Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :36019 Max. :28.030
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
dplyr::glimpse(daily_activity)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
skimr::skim(daily_activity)
| Name | daily_activity |
| Number of rows | 940 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ActivityDate | 0 | 1 | 8 | 9 | 0 | 31 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1 | 4.855407e+09 | 2.424805e+09 | 1503960366 | 2.320127e+09 | 4.445115e+09 | 6.962181e+09 | 8.877689e+09 | ▇▅▃▅▅ |
| TotalSteps | 0 | 1 | 7.637910e+03 | 5.087150e+03 | 0 | 3.789750e+03 | 7.405500e+03 | 1.072700e+04 | 3.601900e+04 | ▇▇▁▁▁ |
| TotalDistance | 0 | 1 | 5.490000e+00 | 3.920000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 | ▇▆▁▁▁ |
| TrackerDistance | 0 | 1 | 5.480000e+00 | 3.910000e+00 | 0 | 2.620000e+00 | 5.240000e+00 | 7.710000e+00 | 2.803000e+01 | ▇▆▁▁▁ |
| LoggedActivitiesDistance | 0 | 1 | 1.100000e-01 | 6.200000e-01 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.940000e+00 | ▇▁▁▁▁ |
| VeryActiveDistance | 0 | 1 | 1.500000e+00 | 2.660000e+00 | 0 | 0.000000e+00 | 2.100000e-01 | 2.050000e+00 | 2.192000e+01 | ▇▁▁▁▁ |
| ModeratelyActiveDistance | 0 | 1 | 5.700000e-01 | 8.800000e-01 | 0 | 0.000000e+00 | 2.400000e-01 | 8.000000e-01 | 6.480000e+00 | ▇▁▁▁▁ |
| LightActiveDistance | 0 | 1 | 3.340000e+00 | 2.040000e+00 | 0 | 1.950000e+00 | 3.360000e+00 | 4.780000e+00 | 1.071000e+01 | ▆▇▆▁▁ |
| SedentaryActiveDistance | 0 | 1 | 0.000000e+00 | 1.000000e-02 | 0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e-01 | ▇▁▁▁▁ |
| VeryActiveMinutes | 0 | 1 | 2.116000e+01 | 3.284000e+01 | 0 | 0.000000e+00 | 4.000000e+00 | 3.200000e+01 | 2.100000e+02 | ▇▁▁▁▁ |
| FairlyActiveMinutes | 0 | 1 | 1.356000e+01 | 1.999000e+01 | 0 | 0.000000e+00 | 6.000000e+00 | 1.900000e+01 | 1.430000e+02 | ▇▁▁▁▁ |
| LightlyActiveMinutes | 0 | 1 | 1.928100e+02 | 1.091700e+02 | 0 | 1.270000e+02 | 1.990000e+02 | 2.640000e+02 | 5.180000e+02 | ▅▇▇▃▁ |
| SedentaryMinutes | 0 | 1 | 9.912100e+02 | 3.012700e+02 | 0 | 7.297500e+02 | 1.057500e+03 | 1.229500e+03 | 1.440000e+03 | ▁▁▇▅▇ |
| Calories | 0 | 1 | 2.303610e+03 | 7.181700e+02 | 0 | 1.828500e+03 | 2.134000e+03 | 2.793250e+03 | 4.900000e+03 | ▁▆▇▃▁ |
A quick inference from the summaries pulled here; For the dailyActivity_merged DF, there are 940 observations and 15 variables 1 charater column and 14 numeric columns, with the character column being dates which should be converted to a date type. * No missing values in the dailyActivity_merged DF
library(dplyr)
library(tidyr)
library(stringr)
daily_activity <- daily_activity %>%
mutate(new_activity_date = as.Date(ActivityDate, format = "%m/%d/%Y")) %>%
mutate(day_of_week = weekdays(new_activity_date)) %>%
mutate(People_ID = as.character(Id))
#View(daily_activity) confirms the new_activity_date contains the dates in Date format
# Or the class(daily_activity$new_activity_date) at the console returns Date as the format
# NOTICE here, day of the week column is also created
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
hourly_calories <- read_csv("hourlyCalories_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(hourly_calories) #fetching first 6 rows
## # A tibble: 6 × 3
## Id ActivityHour Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 81
## 2 1503960366 4/12/2016 1:00:00 AM 61
## 3 1503960366 4/12/2016 2:00:00 AM 59
## 4 1503960366 4/12/2016 3:00:00 AM 47
## 5 1503960366 4/12/2016 4:00:00 AM 48
## 6 1503960366 4/12/2016 5:00:00 AM 48
anyNA(hourly_calories) #checking for missing values
## [1] FALSE
summary(hourly_calories)
## Id ActivityHour Calories
## Min. :1.504e+09 Length:22099 Min. : 42.00
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 63.00
## Median :4.445e+09 Mode :character Median : 83.00
## Mean :4.848e+09 Mean : 97.39
## 3rd Qu.:6.962e+09 3rd Qu.:108.00
## Max. :8.878e+09 Max. :948.00
The summary shows there are no missing values on the hourly_calories DF. Summary also shows the minimum calories burnt is 42 and the max is 948 calories.
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
hourly_calories <- hourly_calories %>%
mutate(activity_date = as.Date(ActivityHour, format = "%m/%d/%Y"))%>%
mutate(day_of_week = weekdays(activity_date)) %>%
mutate(People_ID = as.character(Id))
hourly_calories
## # A tibble: 22,099 × 6
## Id ActivityHour Calories activity_date day_of_week People_ID
## <dbl> <chr> <dbl> <date> <chr> <chr>
## 1 1503960366 4/12/2016 12:00:00 AM 81 2016-04-12 Tuesday 15039603…
## 2 1503960366 4/12/2016 1:00:00 AM 61 2016-04-12 Tuesday 15039603…
## 3 1503960366 4/12/2016 2:00:00 AM 59 2016-04-12 Tuesday 15039603…
## 4 1503960366 4/12/2016 3:00:00 AM 47 2016-04-12 Tuesday 15039603…
## 5 1503960366 4/12/2016 4:00:00 AM 48 2016-04-12 Tuesday 15039603…
## 6 1503960366 4/12/2016 5:00:00 AM 48 2016-04-12 Tuesday 15039603…
## 7 1503960366 4/12/2016 6:00:00 AM 48 2016-04-12 Tuesday 15039603…
## 8 1503960366 4/12/2016 7:00:00 AM 47 2016-04-12 Tuesday 15039603…
## 9 1503960366 4/12/2016 8:00:00 AM 68 2016-04-12 Tuesday 15039603…
## 10 1503960366 4/12/2016 9:00:00 AM 141 2016-04-12 Tuesday 15039603…
## # ℹ 22,089 more rows
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
sleepday <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sleepday)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:0… 1 327 346
## 2 1503960366 4/13/2016 12:0… 2 384 407
## 3 1503960366 4/15/2016 12:0… 1 412 442
## 4 1503960366 4/16/2016 12:0… 2 340 367
## 5 1503960366 4/17/2016 12:0… 1 700 712
## 6 1503960366 4/19/2016 12:0… 1 304 320
anyNA(sleepday) # retuns false -- No missing values
## [1] FALSE
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)
hourly_calories <- hourly_calories %>%
mutate(activity_date = as.Date(ActivityHour, format = "%m/%d/%Y"))%>%
mutate(day_of_week = weekdays(activity_date)) %>%
mutate(People_ID = as.character(Id))
hourly_calories %>%
ggplot(aes(x = People_ID, y = Calories, color = day_of_week)) +
geom_col() + coord_flip() + theme_minimal() -> plot
ggplotly(plot)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)
hourly_calories %>%
group_by(day_of_week) %>%
summarise(mean_in_calories = mean(Calories, na.rm = TRUE)) %>%
fashion()
## day_of_week mean_in_calories
## 1 Friday 97.78
## 2 Monday 97.05
## 3 Saturday 99.87
## 4 Sunday 94.34
## 5 Thursday 97.01
## 6 Tuesday 98.62
## 7 Wednesday 96.87
This suggests there’s good correlation in the calculated means in calories burnt – an indication that the participants used the Hourly Calories app everyday for the 3 months period of activity. It also shows participants rested a little bit more on Sundays.
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)
# Converting Id column to People_ID character format and creating new computed columns, day_of_week and activity_date
sleepday <- sleepday %>%
mutate(activity_date = as.Date(SleepDay, format = "%m/%d/%Y"))%>%
mutate(day_of_week = weekdays(activity_date)) %>%
mutate(People_ID = as.character(Id))
sleepday %>%
ggplot(aes(x = People_ID, y = TotalMinutesAsleep, color = day_of_week)) +
geom_col() + coord_flip() + theme_minimal() -> plot
ggplotly(plot)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)
sleepday %>%
group_by(day_of_week) %>%
summarise(mean_TotalMinutesAsleep = mean(TotalMinutesAsleep, na.rm = TRUE)) %>%
fashion()
## day_of_week mean_TotalMinutesAsleep
## 1 Friday 405.42
## 2 Monday 418.83
## 3 Saturday 420.81
## 4 Sunday 452.75
## 5 Thursday 402.37
## 6 Tuesday 404.54
## 7 Wednesday 434.68
This validates participants slept more on Sundays during the 3-month period as average in minutes of sleep is highest on Sunday.
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrr)
library(plotly)
daily_activity %>%
ggplot(aes(x = VeryActiveMinutes, y = Calories, color = day_of_week)) +
geom_jitter() + geom_abline() + theme_minimal() +
facet_wrap(~ People_ID) -> plot
ggplotly(plot)
This plot shows a linear relationship between VeryActiveMinutes and Calories
This sure shows a decline in calories burnt as sedentary minutes increases per participant.
library(ggplot2)
library(dplyr)
library(tidyr)
library(plotly)
daily_activity %>%
ggplot(aes(x = TotalSteps, y = Calories, color = People_ID)) +
geom_point() + geom_smooth() + theme_minimal() +
facet_wrap(~ day_of_week) -> plot
ggplotly(plot)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Increase in calories burnt as total steps increase
Having been tested by 30 participants in a 3-month survey and with the results seen, the FitBit Fitness Tracker is recommended as a benchmark to building and/or extending Bellabeat’s own fitness apps. Some cues to note are; * Getting more customers to use the apps everyday. Building friendly notifiers and exercise tracking plans would go a long way here. * Including reward systems in apps to help customers use all the apps in Bellabeat is also recommended.
Thank You. Silas O. Bamidele